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Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine 
signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, 
identification of substrates and specific acetylation sites is crucial. Experimental identification is often 
time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a 
high-throughput manner to generate relatively precise predictions. Here we develop a method termed as 
SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine 
sequence-derived and functional features with two-step feature selection. Feature importance analysis 
indicates functional features, applied for lysine acetylation site prediction for the first time, significantly 
improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and 
identify many high-confidence putative substrates that are not previously identified. The results along with 
the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and 
facilitate hypothesis-driven experimental design and validation. 



Lysine acetylation is an important type of reversible post-translational modification (PTM) that takes place in 
the 8-amino group of lysine residues in proteins. Regulation of lysine acetylation is activated by a highly 
balanced enzyme system. In this system, lysine acetyltransferases (KATs) transfer the acetyl group to the 8- 
amino group of lysine, while lysine deacetylases (KDACs) or histone deacetylases (HDACs) remove these acetyl 
groups'. Around 50 years ago, lysine acetylation of nuclear histones was discovered'' followed by the successive 
identification of several acetylation sites in histones. Research over the past five years has shown that this 
reversible covalent modification is strongly related to cell regulation. During this period, more than 2,000 
proteins, including kinases, transcription factors, ubiquitin ligases, structural proteins, ribosomal proteins and 
metabolic enzymes, have been identified as acetylated, not only in histones but also in the cytoplasm of mam- 
malian cells'""^. These proteins are critical for a variety of cellular activities, ranging from the DNA damage 
checkpoint, cell cycle control, and cytoskeleton organization to metabolism and endocytosis. Lysine acetylation 
is crucial for both nuclear and cytoplasmic processes". Most major enzymes involved in the tricarboxylic acid 
cycle (TCA) cycle, nitrogen metabolism, fatty acid oxidation, urea cycle, glycolysis, gluconeogenesis and glycogen 
metabolism undergo lysine acetylation^. 

Our understanding of the regulatory roles of lysine acetylation remains nebulous. Identification of acetylation 
sites is an essential first step towards elucidation of the mechanism underlying protein acetylation. A number of 
experimental methods have been accordingly developed to determine potential acetylation sites, including the 
radioactive chemical method', mass spectrometry'", and chromatin immunoprecipitation (ChIP)". However, 
these conventional experimental techniques are laborious, time-consuming and usually expensive'^. Several high- 
throughput experimental methods such as mass spectrometry-based proteomics also provide a better and larger 
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Figure 1 | Sequence logo illustration generated by IceLogo to show the occurrences of amino acid residue types surrounding the acetylation sites for six 
different species, including H. sapiens, M. musculus, E. coli, S. typhimurium, S. cerevisiae and R. norvegicus. 



coverage of proteome-wide acetylation sites". As an alternative 
approach, computational prediction methods are more efficient 
and applicable for large-scale high-throughput screening of novel 
acetylation substrates. 

A variety of computational approaches have been developed to 
predict lysine acetylation sites'^'""^'*. For example, Basu et al}^ 
developed a prediction method, PredMod, that combines experi- 
mental approaches with clustering analysis to predict protein acet- 
ylation, based on the characteristics of residues surrounding 
acetylated lysines. Clustering of sequences in histones and nonhis- 
tones was used to represent a local amino acid sequence composition. 
Xu and colleagues'" developed a novel approach, Ensemble-Pail, 
which implemented an ensemble support vector machine (SVM) 
classifier with encoded features based on positional weight matrices 
(PWMs). A two-stage SVM-based classifier, N-Ace, proposed by Lee 
et al", was applied to identify protein acetylation sites based on 
features combining the physicochemical properties of proteins with 
accessible surface area. Suo et al}'' developed a position-specific 
SVM-based method, PSKAcePred, with features that included 
information on amino acid composition, evolutionary similarity 
and physicochemical properties to predict lysine acetylation sites. 



In their study, entropy values were used to select or exclude residues 
around the acetylation sites. Although significant progress has been 
achieved in predicting acetylation sites, the existing methods have 
certain drawbacks: (i) The regulation mechanism of lysine acetyla- 
tion differs among species, especially between prokaryotes and 
eukaryotes^^. Therefore, sequences or structural patterns around 
the acetylation sites may significantly differ in different organisms. 
However, the majority of existing studies disregarded the differences 
between species by considering all species-specific acetylation sites as 
generic sites to build a simplified model; (ii) Most existing models are 
established using machine learning techniques, such as SVM. 
However, not all features are equivalently important for the perform- 
ance of the trained model; redundant features will reduce the per- 
formance of the model. Accordingly, feature selection is generally 
required for removing redundant features and improving prediction 
performance. However, limited studies have involved this procedure 
to gain insights into the relative significance and contributory effects 
of various features; (iii) Most earlier studies only extracted features 
based on the sequence environment around the acetylated lysine, but 
failed to consider those descriptive of the whole protein that play a 
decisive role in determining the fate of a protein in terms of lysine 
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Figure 2 | mRMR results of the top 50 features (classified by feature type) for H. sapiens. Each group of features is denoted by different colors. See 
Supplementary Table SI for the fuU list. 



acetylation, especially for those involved in different cell processes. 
The next generation of computational methods thus needs to address 
the above drawbacks in order to generate more accurate models for 
the efficient identification of species-specific lysine acetylation sites. 

Here, we present a novel approach to predict species-specific 
lysine acetylation sites, based on the random forest (RF) algorithm, 
termed SSPKA (Species-Specific Prediction of lysine (K) 
Acetylation). In particular, our method incorporates various inform- 
ative features, including sequence-derived features, predicted sec- 
ondary structure and relevant functional features at both amino 
acid residue and protein levels, coupled with a two-step effective 
feature selection method, to assemble an optimal feature set for 
building the prediction model. SSPKA is benchmarked with other 
existing methods using both 5-fold cross-validation and independent 
tests. A user-friendly web server and the local Java tool of SSPKA are 
freely accessible at http://www.structbioinfor.org/Lab/SSPKA for the 
wider scientific community. A flowchart of the developed SSPKA 
approach is given in Supplementary Fig. SI. 

Results 

Analysis of sequence-level determinants of acetylation site speci- 
ficity. Based on the curated datasets, we analyzed the sequence 
surrounding the lysine acetylation sites and plotted a sequence 
logo (Fig. 1) for the six different species using IceLogo^'', with the 
aim of identifying distinct patterns or conserved sequence motifs 
between acetylation and background sites^*". 

The sequence logo indicates the existence of distinct sequence 
patterns between the six species. In Fig. 1, the large 'K' represents 
the centered acetylation site. Apparently, a primary feature of the site 
specificity across all sfac species is the requirement that other lysine 
residues are located proximal to the centered acetylation site. In 
particular, 'K' is preferred at position +4 in all six species, and this 
is more pronounced in H. sapiens and S. typhimurium, where 'K' 
tends to appear across all positions following the centered acetylation 
site, i.e., from +1 to +5. In addition, 'K' is favored at positions —5, 
—4 and — 1 in S. cerevisiae and S. typhimurium. Other residue types 
in addition to 'K' are also observed. One example is residue 'R', 
favored at position -HI in H. sapiens and M. musculus, —1 in S. 
cerevisiae, and —1 and — 2 in R. norvegicus, respectively. Residue 
'G', as another example, is favored at position — 1 in eukaryotes such 
as H. sapiens, M. musculus and S. cerevisiae. On the other hand, we 



observe several residues that are disfavored using IceLogo. For 
instance, residue 'L' is disfavored at positions —6, —4, 3, 5 and 6 in 
E. coli, and position —4 in R. norvegicus, respectively, whereas res- 
idue 'D' is not favored at position —2 in S. cerevisiae. Altogether, 
these results highlight the necessity and significance of addressing 
the task of precise lysine acetylation site recognition by developing 
species-specific predictors. 

Two-step feature selection via random forest. As heterogeneous 
features are often noisy and redundant''^, leading to an adverse 
impact on model training, such as decreasing performance, we 
performed feature selection to remove redundant features and 
assess those important for prediction performance. In particular, a 
two-step feature selection method was applied in our study. We have 
recently applied this two-step feature selection approach to address 
the task of protease-specific cleavage target prediction""*. 

In the first-step feature selection, we estimated the relative import- 
ance of each feature using the minimum-redundancy maximum- 
relevance (mRMR)^' approach, which ranked each input feature 
according to its relevance to the classification variable as well as 
redundancy among all features. Features that are ranked highly by 
mRMR generally have an appropriate balance between the max- 
imum relevance and minimum redundancy. This step is based on 
the benchmark dataset. We obtained the top 100 features as the 
optimal candidates (OFCs) after this step. As observed from Fig. 2 
and Supplementary Fig. S2, functional features had the highest rank- 
ing of importance scores. AAindex and PSSM features additionally 
had a relatively high importance score. The predicted secondary 
structures, such as those predicted by SABLE and SpineX, had rela- 
tively lower ranking values. These results indicate that the contribu- 
tive features in our method are predominantly sequence-derived and 
functional. 

The second step was a stepwise feature selection, i.e. incremental 
feature selection (IFS) based on the RF classifier. At each round of 
stepwise feature selection, the next feature from the mRMR-ranked 
feature list was added to the model, and the resulting performance of 
the model calculated. To evaluate the performance, 5-fold cross- 
validation tests were applied, whereby the benchmark dataset was 
randomly divided into five subsets. Each subset in turn was used as a 
holdout set. For each holdout set the remaining four subsets were 
merged to form the training set for the RF model, while the holdout 
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Figure 3 | IFS (Incremental Feature Selection) curves of acetylation site prediction for H. sapiens, M. musculus, E. coli, S. typhimurium, S. cerevisiae 
and R. norvegicus, respectively. 



set was used as the testing set for validation of the model. The cor- 
responding feature subset with which the RF classifier achieved the 
highest AUG score was considered the final optimal feature subset 
and used to buUd the prediction model. By iteratively adding inform- 
ative features from the initial OFCs, the prediction performance of 
the model was gradually increased during this procedure (Fig. 3). The 
best performance in Fig. 3 is the final AUG for the corresponding 



species. And the feature set for that performance was the final 
optimal feature set. At the same time the corresponding model was 
the final model based on the benchmark datasets. Fig. 3 shows the 
whole process of second feature selection. 

This two-step feature selection, which combines mRMR feature 
ranking and stepwise feature selection, provides a practical approach 
for selecting a useful subset of informative features, and has been 



SCIENTIFIC REPORTS | 4:5765 | DOI: 1 0. 1 038/srep05765 



4 



adopted by other prediction tasks^""'" Finally, we obtained a more 
compact informative feature subset that improved the prediction 
performance of RF classifiers for each species (A complete list of 
the final selected optimal features for each species is provided in 
Supplementary Table SI). 

Feature importance and contribution. As mentioned previously, 
the optimal feature subset was selected with a two-step selection 
procedure. After this procedure, ten different feature types were 
retained in the respective optimal subsets for each species, 
including functional features, AAindex, functional annotation, 
physicochemical properties, PSSM, conservation score, Disopred, 
Sable and SpineX. The number of selected optimal features 
differed, depending on the species of interest. For example, H. 
sapiens had the largest subset of optimal features (a total of 58), 
while R. norvegicus had the smallest subset with only 8 features 
used to build its specific model. In addition, the number of 
AAindex features was greater than that of other features for all 
species, presumably because the proportion of initial AAindex 
features prior to selection is extremely high, relative to other 
features (7280 to 7973). In contrast, the numbers of 
physicochemical properties, conservation score. Disorder and 
Sable features were relatively small. More importantly, functional 
features (only 7 in total) were entirely selected for optimal subsets 
of H. sapiens and M. musculus, suggesting that protein functional 
features play significant roles in determining the prediction 
performance of the model. Moreover, both AAindex and PSSM 
features were included in the optimal feature subsets for all six 
species, indicating that sequence-derived features represent a 
critical factor in determining the predictive power of the model. 

Sequence-derived features have been extensively used in model 
training and reported as crucial for acetylation site prediction in a 
number of previous studies. In our investigation, sequence-derived 
features, such as the PSSM profile, AAindex and physicochemical 
properties, were found to be indispensable for improving the predic- 
tion performance of lysine acetylation sites across all six species. To 
our knowledge, functional features were applied for the first time to 
build accurate models for predicting lysine acetylation sites. They 
have contributed significantly to performance improvement of our 
model, along with other complementary feature sets (see 
"Comparison with other tools" section for details). 

Prediction performance was evaluated based on the AUG score. 
Firstly, we compared the mean values of the selected optimal features 
in positive and negative datasets using the statistical unpaired two- 
sample t-test to verify whether the two datasets were significantly 
different. The given P value was used to estimate statistical signifi- 
cance between the two datasets for a specific feature. Results are 
illustrated in Supplementary Fig. S3 and Fig. S4. For most selected 
optimal subset features, P values were lower than 0.01/« (where n is 
the number of tests performed, in this case, the number of features) 
according to the Bonferroni adjustment, indicating that the positive 
and negative datasets are significantly different from each other. This 
finding highlights the discriminative power of these features for 
prediction. 

We continued to evaluate the importance and individual contri- 
bution of each feature type to the prediction performance of the 
model. For each species, all features of the feature type were taken 
out of the optimal feature set in turn, and the remaining feature types 
used to build the corresponding model for predicting acetylation 
sites. The prediction performance of the resultant model was eval- 
uated using the AUG measure. A feature is considered to contribute 
significantly to performance if the AUG score of the model in its 
absence decreases considerably, compared with that of the original 
model buUt using all the optimal features, as presented in Fig. 4. 
Taking H. sapiens as an example, there are eight types of features 
in its model. The AUG score for functional features was the lowest 



and this suggests that when removing this type of features from all 
eight types of features, the performance would considerably decrease, 
implying that functional features make a more important contri- 
bution than other features for predicting the acetylation sites of H. 
sapiens. 

We additionally quantified the contribution of each specific fea- 
ture by examining the difference between the AUG score of the 
model to that using only the examined feature as input and the 
AUG score using all other optimal features but excluding that par- 
ticular feature as input. This analysis facilitated determination of the 
individual features that had contributed more significantly to the 
prediction performance of the model. Our results are presented in 
Supplementary Fig. S5. Taking H. sapiens as an example, the 7973rd 
feature was functional feature_7 (Protein-protein interaction score) 
(See Supplementary Table SI for a full list of all the final features). 
The red bar in Supplementary Fig. S5 denotes the AUG score of the 
model that was trained using this particular feature only, which was 
over 0.6, whereas the blue bar indicates the AUG of the model that 
was trained using all other optimal features but excluding this par- 
ticular feature, which was nearly 0.8. Altogether, these results indi- 
cate that this feature is relatively important. 

Prediction performance of SSPKA based on benchmark datasets. 

We evaluated the prediction performance of the SSPKA models 
based on the final optimal features, using 5 -fold cross-validation 
tests based on the benchmark datasets. The results are presented in 
Table 1 and Supplementary Table S2. RF models for all six species 
displayed relatively good performance with AUG scores ranging 
from 0.746 to 0.883. Among these species, the performance of the 
model for S. typhimurium was the worst with an AUG of 0.746, while 
that for R. norvegicus was optimal with an AUG of 0.883. The models 
trained using the selected optimal features for H. sapiens, M. 
musculus, E. coli and S. cerevisiae achieved AUG scores of 0.794, 
0.788, 0.782 and 0.795, respectively. 

A significant feature of this work, distinct from previous lysine 
acetylation site prediction studies, is the characterization and incorp- 
oration of statistically over-represented functional features by per- 
forming hypergeometric tests on the background protein datasets"". 
While previous studies mostly focused on the extraction of useful 
sequence or sequence-derived features, such as PSSM and AAindex, 
our current model took into consideration other important, relevant 
functional features that could be used in combination with sequence- 
derived features to improve accuracy. 

Indeed, several key functional features, including KEGG, GO GG, 
BP, MF and PPI, contributed significantly to improvement of lysine 
acetylation site prediction. Supplementary Tables S3-S8 provide 
complete lists of the significantly enriched functional feature terms 
with P < 0.01/« {n is the number of tests performed and in this case, 
the number of terms) according to the Bonferroni adjustment) for all 
six species. The enrichment or depletion of these functional features 
reflects a specific inclination or preference of the functional require- 
ments of different acetylated proteins, such as those for cellular 
compartments, related pathways, and protein-protein interactions. 
Consequently, inclusion and encoding of the informative functional 
features helped improve prediction performance. 

Here, we further characterized significantly enriched terms at the 
functional level. First, over-represented functional features were 
identified by hypergeometric tests. Significantly enriched functional 
feature terms for each species are shown in Supplementary Tables 
S3-S8 (only the top ten terms are listed). Taking H. sapiens as an 
example, there were 2233 enriched protein interaction partners in 
terms of PPI features. With regard to KEGG pathway features, acety- 
lated H. sapiens proteins were enriched in metabolic pathway terms, 
such as "Valine, leucine and isoleucine degradation", "Glycolysis/ 
Gluconeogenesis" and "Pyruvate metabolism". In addition, these 
proteins were enriched in certain disease pathway terms, including 
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"Viral carcinogenesis", "Systemic lupus erythematosus" and 
"Pathogenic Escherichia coli infection" (Supplementary Table S3). 
The results suggest that acetylated H. sapiens proteins are involved in 
metabolic processes related to disease pathways. Similar pathway 
terms were found in M. musculus (Supplementary Table S4). In 
terms of Biological Process (BP) terms, again, acetylated proteins 
were associated with transcriptional processes, including "trans- 
lational termination", "translational initiation", "translational 
elongation", "viral transcription" and "mRNA splicing, via spliceo- 
some" (Supplementary Table S3). In terms of Molecular Function 
(MF) terms, acetylated proteins were enriched in functions related to 
nucleotide binding, such as "RNA binding", "DNA binding", "chro- 
matin binding" and "nucleotide binding". 

Sequence-derived features are additionally useful for predicting 
lysine acetylation sites. As shown in Supplementary Table SI, the 
selected optimal features derived from sequences mainly include 
AAindex and PSSM features of residues at positions surrounding 
potential lysine acetylation sites. Supplementary Table S9 displays 
an overall statistical analysis of aU the selected optimal features for 
each species. We did not elaborate on these features in this section, 
since their utility has been established in previous studies. In addi- 



tion, the application of powerful feature selection techniques, such as 
those used in this study, allowed quantification of the relative 
importance and contribution of each feature type to lysine acetyla- 
tion site prediction. Our findings collectively provide critical insights 
into the key determinants of lysine acetylation sites at both sequence 
and functional levels. 

Comparison with other existing tools based on independent test 
datasets. Both 5-fold cross-validation and independent tests were 
conducted to compare the performance of our method with other 
previously published methods, including Phosida^', BRABSB^"*, 
PLMLA", LysAcet"", ensemblePail" and PSKAcePred". Phosida" 
and PSKAcePred"" used the binary encoding features of amino 
acids as input features of the model. BRABSB^'' was a SVM-based 
human-specific lysine acetylation predictor that was developed using 
a novel bi-relative adapted binomial score Bayes (BRABSB) feature 
extraction method. PLMLA", LysAcet^" and ensemblePaiP" utilized 
position-weighted matrix or position-weighted amino acid 
properties, similar to the PSSM profile, as part of the input features 
to build the models. Moreover, PLMLA" employed the secondary 
structure predicted by PSIPRED, while PSKAcePred"' combined 
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solvent accessible surface area and KNN scores to train the model. 
Our SSPKA method incorporated not only the sequence-derived 
features previously shown to be useful for prediction but also the 
over-represented protein functional features, which made a 
significant contribution to the prediction power of the model. 
Another major difference between our method and other 
techniques lies in the fact that all earlier tools built the prediction 
model using SVM, whereas our method used RF based on decision 
trees to train and build the model. 

We initially compared the performance of our method with other 
methods using the benchmark datasets based on 5-fold cross-valid- 
ation tests. ROC curves of all methods are shown in Fig. 5, which 
describe the true positive rate as a function of false positive rate for 
different trade-offs between the sensitivity and specificity. Our 
method clearly outperformed the other five techniques for all six 
species. AUG scores of 0.794, 0.788, 0.746, 0.782, 0.795 and 0.883 
were achieved for H. sapiens, M. musculus, S. typliimurium, E. coli, S. 
cerevisiae and R. norvegicus, respectively. These results indicate that 
our model provides a better predictive power than existing tools on 
benchmark datasets. 

We further performed two independent tests by considering two 
different situations of negative sample selection to further compare 
the performance of our method (the first situation is to randomly 
select negative samples from proteins that contain positive samples 
and proteins in the background, and the second is to select negative 
samples only from proteins that contain positive samples, excluding 
the background dataset). The corresponding ROG curves for these 
two situations are displayed in Supplementary Fig. S6 and S7, 
respectively. 

In the first situation, our methods stiU outperformed the majority 
of other methods for all the species examined. One exception was the 
case of R. norvegicus for which SSPKA achieved an AUG of 0.696, 
which was slightly lower than that of LysAcet (AUG = 0.729), indi- 
cating slightly weaker performance of our method. The second situ- 
ation represents a more difficult scenario for lysine acetylation site 
prediction, since both positive and negative samples were obtained 
from the same proteins, thus representing a subtler situation and 
confounding the prediction of positives and negatives. In the second 
situation, our method still outperformed others for three species, M. 



musculus, S. typhimurium and E. coli. However, our method per- 
formed worse than PLMLA for S. cerevisiae, LysAcet for R. norvegi- 
cus, and PLMLA and BRABSB for H. sapiens, respectively. The main 
reason is that the contribution of functional features in this situation 
became marginal, due to the selection of samples only from proteins 
containing both negative and positive sites, reducing the predictive 
power of models relying on global functional features. The perform- 
ance comparison results of our method with other existing methods 
based on different datasets are shown in Table 1 and Supplementary 
Table S2. To examine the statistical significance, we further per- 
formed pairwise t-test using the prediction outputs from different 
methods. The results showed that the performance differences 
between our method SSPKA and other methods were, in most cases, 
statistically significant (Supplementary Table S 1 0) . Finally, we would 
like to emphasize that the testing results on the independent tests 
were more reliable and hence should be given higher weights when 
evaluating the performance of different methods. 

Cross-species performance evaluation. To examine whether each of 
the species-specific models could perform better for its original 
species than other species, we performed cross-species 
performance evaluation for each of the models by testing and 
comparing its performance on all other species. Here, the model's 
performance was tested by independent test datasets. As shown in 
Supplementary Fig. S8, the original model consistently performed 
the best when being applied to predict acetylation sites for the 
original species than other species. For example, the human- 
specific model achieved an AUG score of 0.756. In contrast, it 
achieved lower AUG scores of 0.606-0.698 for predicting 
acetylation sites of the other five species. This is also the case for 
other species-specific models, which consistently achieved the higher 
AUG scores when being applied to their own species than other 
species. In summary, these results justify the necessity and 
importance of developing species-specific models to improve the 
prediction of lysine acetylation sites. 

Influence of the selection of negative datasets on the performance. 

As the selection of negative data might influence the final prediction 
performance of the model, we examine this aspect by re- training the 



SCIENTIFIC REPORTS | 4:5765 | DOI: 1 0. 1 038/srep05765 



7 



H. sapiens 



M. musculus 





S. cerevisiae R. norvegicus 




0.0 0.2 04 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

False positive rate False positive rate 



Figure 5 Performance comparison between our method and other tools on 5-fold cross-validation test datasets. 

models with randomly selected negative data and testing the resulting corresponds to the original model, which achieved an AUG value 
model's performance on the independent test dataset. This procedure of 0.756, while all the other five blue curves correspond to the re- 
was repeated five times and we generated the corresponding ROC trained models with randomly generated negative data, with close 
curves in Supplementary Fig. S9. As can be seen, the red curve AUG values ranging from 0.759 to 0.765. Therefore, these results 
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show that the selection of negative datasets has little or minor 
influence on the performance of the model. 

Proteome-wide prediction and implementation of online web 
server and local Java applet of SSPKA. We have implemented an 
online web server and local Java tool of SSPKA to facilitate high- 
throughput in silico prediction of lysine acetylation sites, which can 
be freely accessed at http://www.structbioinfor.org/Lab/SSPKA/. 
After demonstrating SSPKA's ability to predict lysine acetylation 
sites using both benchmark and independent datasets, we applied 
our model to screen the entire human proteome (a total of 84,919 
proteins), with the aim of identifying high-confidence novel lysine 
acetylation sites that have been overlooked with other experimental 
techniques. To generate high-confidence prediction results, SSPKA 
models were trained using the final optimal features based on the 
whole training dataset. Proteins containing predicted lysine 
acetylation sites (adopting a prediction threshold of 0.517, which 
corresponds to the upper-left-most point in the ROC curve for 
human) were classified as high-confidence acetylated substrates. 
Consequently, our high-throughput in silico analysis identified 
17,464 acetylated substrates with 66,255 predicted lysine 
acetylation sites. A complete list of the predicted acetylation 
substrates with detailed annotations of substrate protein IDs, 
acetylated lysine positions and amino acid sequences is available 
on the SSPKA website (http://www.structbioinfor.org/Lab/SSPKA/). 
The proteome-wide predictions represent a valuable resource for 
experimental validation of novel human acetylation substrates and 
generation of usefiil hypotheses. 

The implemented online web server, local Java Applet and user 
instructions are also available at the website. In particular, an import- 
ant advantage of the Java applet is that it provides a user-friendly 
interface and allows high-throughput in silico screening analysis of 
putative acetylation substrates (see Supplementary Fig. SIO for a 
screenshot of the interface and an example output of the implemen- 
ted Java tool). 

Analysis of proteome-wide prediction results in the sense of 
mimicking the distribution of acetylation sites. We further 
analyzed the proteome-wide prediction results to examine whether 
there was a correlation in the number of predicted acetylation sites to 
protein sequence length and whether there was a correlation in the 
number of predicted acetylation sites to the number of lysines in the 
protein sequence. We thus plotted these two distributions in 
Supplementary Fig. Sll. We can see that the two distributions 
were not similar to each other, with varying correlation coefficients 
of less than 0.5, indicating no significant correlations between the 
two types of distributions. 

Discussion 

The performance of some competing methods (PSKAcePred and 
Phosida) seemed to be lower than their original published results. 
Here, we provide our explanations. Firstly, many machine learning 
algorithms work better on balanced datasets than on imbalanced 
datasets. In this study, the number of negative samples (i.e. non- 
acetylation sites) is much larger than that of positive samples (i.e. 
acetylation sites). A widely adopted strategy in model training is to 
select a balanced number of positives versus negatives with a ratio of 
1:1, 1:2 and 1 : 3. This strategy has been used by many tools in 
protein post-translational modification (PTM) site prediction, 
including acetylation site prediction (ref 18, 19, 20, 22); Secondly, 
use of different benchmarking datasets might lead to the lower per- 
formance of these methods. We selected the positive datasets with 
high credibility (according to PubmedMS2 or CstMS2 values from 
PhosphoSitePlus datasets). Proteins that were previously considered 
as non-acetylated might be acetylated under certain conditions. 
Thus, the benchmark datasets have to be updated in a timely manner 



to reflect the current annotation status of acetylated proteins; 
Thirdly, the negative datasets were selected not only from lysine 
residues excluding known lysine acetylation sites on the same pro- 
tein, but also from other lysine residues on non-acetylated proteins 
(proteins not shown to be acetylated to date), while other methods 
only selected negative sites on the same proteins. In the latter case, it 
would be difficult to collect a group of proteins that can be strictly 
regarded as non-acetylated proteins. In such case, methods that did 
not consider the background proteins relative to acetylated proteins 
would not perform well on the benchmark datasets that not only 
included acetylated proteins but also incorporated vast numbers of 
background proteins; Lastly, some tools (e.g. BRABSB) only pro- 
vided a valid model for H. sapiens. As it was trained using acetylation 
data only from H. sapiens, it might not work well when applied to 
predict the acetylation sites for other species. 

In this study, we have developed a novel integrative approach, 
termed SSPKA, that has significantly improved the prediction per- 
formance of species-specific lysine acetylation sites across six differ- 
ent species, i.e., H. sapiens, M. musculus, S. typhimurium, E. coli, S. 
cerevisiae and R. norvegicus, by combining a variety of sequence- 
derived and functional features from multiple sources. SSPKA 
employs an efficient two-step feature selection framework to char- 
acterize the sequence and function-level features that are significant 
and relevant for the determination of true acetylation sites. 
Benchmarking experiments indicate that SSPKA is able to perform 
competitively, compared with existing tools. Moreover, a user- 
friendly webserver and local java program that suit the purposes of 
various biological users for the high-throughput in silico prediction 
of lysine acetylation substrates and sites have been made freely avail- 
able. We anticipate that SSPKA will be used as a powerful tool for 
hypothesis-driven experimental studies on novel acetylation sub- 
strates and their biological functions. 

Methods 

Datasets. Annotations of lysine acetylation sites were extracted from multiple public 
resources. These include CPLA'* (http://cpla.biocuckoo.org/), N-ACE" (http://N-Ace. 
mbc.NCTU.edu.tw/), Phosida" (http://www.phosida.com/), ASEB" (http://cmbi. 
bjmu.edu.cn/huac) and PhosphoSitePlus'"' (http;//www.phosphosite.org). Amongst 
these, CPLA is a lysine acetylation database that integrates abundant protein 
annotations, while PhosphoSitePlus and Phosida are comprehensive databases for 
post-translational modifications, including lysine acetylation data. N-ACE and ASEB 
are two lysine acetylation prediction tools that provide training datasets for their 
model^^ '^. To extract annotations from UniProtKJi/Swiss-Prot''^, all protein data were 
mapped to the UniProt database to retrieve the corresponding Uniprot IDs. After 
removing all identical sequences among the seven initial databases, we finally collected 
27,075 experimentally verified acetylation sites from 10,713 protein sequences. Large 
amounts of protein acetylation data resulted from the rapid development of high- 
throughput proteomic technologies. However, in many cases, the probability scores of 
MS2 (mass spectrum) were substantially low, indicating a low likelihood of true 
acetylation sites. Accordingly, we removed acetylation sites from our datasets with 
PubmedMS2 or CstMS2 values smaller than 10 in PhosphoSitePlus. Acetylation sites 
were grouped according to the corresponding species. As homologous sequences lead 
to overestimation of the prediction accuracy of built models, we clustered protein 
sequences at the 30% identity level using CD-HIT"*** software. Finally, species 
containing more than 40 acetylation sites were included in our positive datasets, 
predominantly because datasets of fewer samples are not sufficiently large to generate 
a valid machine learning model. 

Table 2 shows the statistics of the fmal species-specific datasets curated. In total, 
our final datasets contained 1,936 proteins with 3,956 acetylation sites from six 
species, including both prokaryotes and eukaryotes. The datasets can be downloaded 
at the website http;//www.structbioinfor.org/Lab/SSPKA. Negative samples were 
randomly selected, not only from lysine residues (excluding known lysine acetylation 
sites on the same protein) but also other lysines of non-acetylated proteins (proteins 
that were not shown to be acetylated to date), with a ratio of 1 : 2 of positive versus 
negative sites (i.e., random sampling of one positive sample accompanied with one 
negative sample on the same protein, as well as one negative sample from non- 
acetylated protein). In addition, 20% of each final dataset was randomly singled out as 
an independent test dataset to evaluate and compare performance between our 
method and other previously published protocols, while the rest was used as the 
training dataset to optimize the parameters, train the models and assess performance 
in 5-fold cross-validation tests. 

Feature extraction. The extracted features were classified into four major categories: 
sequence-derived features, predicted secondary structures, functional annotations 
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Table 2 | Statistics of species-specific lysine acetylation datasets 
curated in this study, covering six species H. sapiens, M. musculus, 
E. coli, S. typbimurium, S. cerevisiae and R. norvegicus 

Species Acelylated proteins Acetylation sites 


H. sapiens 


1121 


2368 


M. musculus 


426 


935 


E. coli 


143 


246 


S. typhimurium 


182 


250 


S. cerevisiae 


44 


86 


R. norvegicus 


20 


71 


Total 


1936 


3956 



and functional features. In keeping with earlier studies a local sliding window 
approach with 13 residues centered on the lysine of interest was employed to extract 
the sequence- derived features of each candidate residue. In total, 7,973 features from 
different feature types were extracted (Table 3). 

Sequence-derived features. Position- specific scoring matrix. Multiple sequence 
alignment containing the evolutionary information of a sequence in the form of 
position- specific scoring matrix (PSSM) has been shown to significantly improve 
prediction performance. Each element in PSSM indicates the probability of the 
individual residue at that specific position in the multiple sequence alignment^ The 
PSSM profile of each sequence was generated from P SI -BLAST'*", and a local sliding 
window approach adopted to encode the matrix of a given sequence fragment 
surrounding potential acetylation sites. The parameters for running PSI-BLAST were 
set as the default £-value cutoff, and three iterations used to search against the non- 
redundant NCBI NR database. 

AAindex. We employed the AAindex*' database to extract various biochemical and 
physicochemical properties of amino acids, which are major features. Three sections 
are included in the AAindex, specifically, AAindexl for the 20 numerical amino acid 
values, AAindex2 for the amino acid mutation matrix, and AAindex3 for statistical 
protein contact potential. 

Evolutionary conservation score. Evolutionary conservation is commonly employed 
as an important feature for prediction. A more conserved residue within a protein is 
indicative of higher importance for protein function. Evolutionary conservation 
features are extracted from the PSSM profile generated by PSI-BLAST. A lower 
conservation score means higher conservation at a specific position. 

The conservation score is defined as: 

20 

Scorei^ - ^pijlog^pij (1) 

7=1 

where pi j is the frequency of amino acid type j at position i. 



Predicted secondary structure. Protein secondary structure is a useful feature to 
predict lysine acetylation sites. However, due to the limited number of protein 
substrates with available structural information, we predicted the secondary structure 
from amino acid sequences using SABLE^^. For each residue of the query sequence, 
SABLE outputs three secondary structure types, H, E and C, denoting alpha-helix, 
beta-strand and coU, respectively. We encoded the predicted secondary structure in 
our model using 3-bit binary encoding^^. 

Predicted solvent accessibility. Solvent accessibility is another important feature for 
acetylation site prediction. SpineX^ was employed to predict the solvent accessibility 
information for each protein, which provided a quantitative score representing the 
extent of relative solvent accessibility of a residue from fully buried to fully exposed. 

Disordered region. A protein disordered region lacks a well-defined tertiary 
structure, and is either fully or partially unfolded. Earlier researchers suggested that 
disordered regions were 'useless'. However, over recent years, disordered regions have 
been shown to be involved in several important biological functions*^. For instance, 
many phosphorylation sites are located in disordered, rather than non-disordered 
regions^'*^. As such, disorder information contributes to phosphorylation site 
prediction*^. Here, we extracted predicted disorder information calculated using 
DISOPRED2, which was also added as the input to our models*^. 

Functional annotation. Functional annotation of a protein in UniProt can be found 
in the "FT" line of the annotation^^. Several different types of functional annotations 
were used as features, including DOMAIN, NP_BIND, DISULFID, MOD_RES, 
CARBOHYD, ACT_SITE, VARIANT, METAL, and BINDING, which represent 
domain, nucleotide binding, disulfide bond, post-translational modified residue 
(acetylation removed), glycosylation, active site, natural variant, metal ion binding 
site and binding site, respectively. Within the sliding window, the amino acid is 
encoded as "1" if that site has the annotation of a specific function. On the other hand, 
an amino acid without the functional annotation is encoded as "0". In total, there are 
13 (window size) X 9 (annotation types) — 117-dimensional encoded features for this 
type. 

Functional features. Inclusion of functional features of a whole protein and 
assessment of their contribution to performance is a crucial aspect of this work. To 
address this, we included protein functional features from the Gene Ontology 
database^** and other biological databases, including Biological Process features (BP), 
Cellular Component features (CC), Molecular Function features (ME), functional 
domain features from InterPro^', pathway features from KEGG^^, functional domain 
features from Pfam^^ and protein- protein interactions from PPP*. 

Random forest classifier. We employed a machine learning approach- random 
forest to generate models for lysine acetylation site prediction. RE is an ensemble 
learning method based on the classification tree^^, which "votes" for one of the two 
classes, either positive (acetylation sites) or negative (non- acetylation sites). The 
experimentally verified acetylation sites in the datasets were labeled ' 1 ' and all other 
lysine residues labeled '— 1'. As described above, the physicochemical properties of a 
lysine residue of interest were represented by a series of input feature vectors and 
encoded into RE classifiers to identify whether or not the residue was an acetylation 
site. RE is considered as one of the most accurate machine learning algorithms 



Table 3 | A summary of feature type, annotation and dimensionality. Features can be classified into four major categories: sequence- 
derived features, predicted secondary structure, functional annotation and functional features 

Feature type Annotation Dimensionality 



Sequence PSSM (PSI-BLAST) 260 

AAindex 7280 

Physicochemical properties of the whole protein 1 0 

Evolutionary conservation score 1 3 

Predicted secondary structure SABLE score 39 

DISOPRED score 26 

SpineX score 221 

Functional Annotation Domain 13 

Nucleotide binding 1 3 

Disulfide bond 13 

Posttranslational modified residue (acetylation is removed) 1 3 

Glycosylation 1 3 

Active site 1 3 

Natural variant 13 

Metal ion binding site 1 3 

Binding site 1 3 

Functional Features Gene ontology 3 

KEGG pathway 1 

Pfam 1 

InterPro 1 

Protein-protein interaction 1 
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available that produce highly accurate classification results. In addition, it can be used 
to select more important variables and efficiently handle large datasets^^. Owing to 
these advantages, RF has been frequently used to address classification problems in 
bioinformatics, such as prediction of DNA-binding sites, RNA-binding sites, residue- 
residue contacts, functional sites, disease-causing non-synonymous SNPs and metal- 
binding sites^"-^'-"-^". 

Performance evaluation. We used six measures, Matthews Correlation Coefficient 
(MCC), Accuracy (ACC), Sensitivity (SEN), Specificity (SPE), Precision (PRE) and 
Area Under the Receiver- Operating Characteristic Curve (AUC), to evaluate 
performance. For the six species- specific datasets, an under- sampling strategy with a 
1 : 2 ratio between positive and negative samples was adopted. It is not reasonable to 
assess performance using Accuracy (i.e., the proportion of true positives and true 
negatives on the dataset) based on an imbalanced dataset. AUC is the area under the 
receiver-operating characteristic (ROC) curve, presented as a plot of true positive rate 
(TPR i.e. SEN) against false positive rate (FPR). The AUC value of a ROC curve 
summarizes the overall performance of a corresponding model or method. An AUC 
value of 1.0 indicates perfect prediction, while 0.5 signifies complete random 
prediction. We consider AUC a more appropriate measure for comprehensively 
evaluating the overall quality of the RF-based classifier performance. 
MCC is defined as: 



MCC- 



TPxTN-FPxFN 



^/{ TP + FP)iTP + FN)( TN FP){ TN + FN) 



ACC is defined as: 



SEN is defined as; 



SPE is defined as: 



PRE is defined as: 



FPR is defined as; 



ACC-- 



TP + TN 



TP^TN + FP^FN 



SEN - TPR - TP /{TP -\- FN) 
SPE^TN/(TN + FP) 
PRE^TP/{TP^FP) 
FPR^FP/{TN + FP) 



(2) 

(3) 

(4) 
(5) 
(6) 
(7) 



where TP, TN, FP and FN represent the numbers of true positives, true negatives, false 
positives and false negatives, respectively. 
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